feat(e2e): Add ECS E2E testing infrastructure and test suites by JLineaweaver · Pull Request #44061 · DataDog/datadog-agent

JLineaweaver · 2025-12-10T12:39:30Z

What Does This PR Do?

Migrates ECS E2E tests from test/new-e2e/tests/containers/ecs_test.go into a dedicated test/new-e2e/tests/ecs/ package with a shared base suite, stronger validation, and CI integration.

What changed

New ECS test package (test/new-e2e/tests/ecs/):

base.go + base_helpers.go — Shared test suite with AssertMetric, AssertLog, AssertAPMTrace, AssertAgentHealth, AssertCheckRun helpers (1026 lines)
18 tests across 4 test files, each validating specific metrics, tags, or trace structures against regex patterns

Suite	Tests	What it validates
`checks_test.go`	5	Nginx, Redis, Prometheus autodiscovery with full tag validation (20+ regex patterns per check, metric + log tags) on EC2 and Fargate
`apm_test.go`	6	DogStatsD (UDS/UDP) with 23-tag regex sets, trace collection (UDS/TCP) with 13-pattern bundled tag validation
`platform_test.go`	4	Windows Fargate, BottleRocket, CPU value range validation, container lifecycle
`managed_test.go`	3	Agent health, trace collection with bundled ECS metadata on managed instances

E2E framework changes (test/e2e-framework/):

Added tracegen ECS app definition for managed instance testing
Added WithLinuxBottleRocketNodeGroup() and WithManagedInstanceNodeGroup() ECS options
Added ECS container instance wait utility
Updated agent, nginx, redis, aspnetsample ECS app configs (DD_SERVICE/DD_ENV/DD_VERSION labels)

Containers package cleanup (test/new-e2e/tests/containers/):

Deleted old ecs_test.go (677 lines, migrated to new package)
Removed unused testCheckRun types and function from base_test.go

CI integration:

Added new-e2e-ecs job in .gitlab/test/e2e/e2e.yml
Updated CODEOWNERS for test/new-e2e/tests/ecs/

Infrastructure config

BottleRocket node group enabled in platform suite
FakeIntake retention set to 31 minutes across all suites (prevents data loss during long test runs)
Shared base suite with helpers that validate tags via regex, send Datadog events on pass/fail, and support filtering

Key design decisions

Every test validates specific values via regex, not just tag prefix existence
Trace validation uses bundled _dd.tags.container format (DD_APM_ENABLE_CONTAINER_TAGS_BUFFER=true)
Infrastructure readiness tests (Test00UpAndRunning) run first alphabetically to warm up the agent

Motivation

The old ECS tests lived in the containers package and lacked rigorous validation. This migration creates a dedicated package with focused tests where each passing test gives real confidence in specific metrics, tags, and traces.

Possible Drawbacks / Trade-offs

Suite provisions real AWS infrastructure (ECS clusters, EC2 instances, Fargate tasks) — costs money and takes time
Tests depend on specific workload apps (tracegen, dogstatsd, nginx, redis, stress-ng, prometheus) being deployed correctly

Additional Notes

See test/new-e2e/tests/ecs/README.md for full documentation including running instructions and coverage matrix.

…suites Implements JIRA ticket EXP-133: Add e2e testing to the agent for ECS environments. This PR adds comprehensive E2E test coverage for all three ECS deployment scenarios (Fargate, EC2, and Managed Instances) across APM/Tracing, Logs, Configuration/Discovery, and Resilience/Error handling. ## New Test Suites (48+ tests total): - test/new-e2e/tests/ecs/apm_test.go: 8 APM/tracing tests - Basic trace collection, multi-service tracing, sampling, tag enrichment - Trace-log correlation, Fargate and EC2 specific scenarios - test/new-e2e/tests/ecs/logs_test.go: 9 log collection tests - Container log collection, multiline handling, JSON parsing - Log sampling, filtering, source detection, status remapping, trace correlation - test/new-e2e/tests/ecs/config_test.go: 7 configuration tests - Env var configuration, Docker label discovery, task definition discovery - Dynamic configuration, metadata endpoints, service discovery, config precedence - test/new-e2e/tests/ecs/resilience_test.go: 8 resilience tests - Agent restart recovery, task failure recovery, network interruption handling - High cardinality, resource exhaustion, rapid container churn, large payloads, backpressure - test/new-e2e/tests/ecs/managed_test.go: 12 managed instance tests - Managed instance specific functionality validation ## Test Applications: - test/e2e-framework/components/datadog/apps/ecs-multiservice/: 3-tier distributed tracing app - test/e2e-framework/components/datadog/apps/ecs-log-generator/: Comprehensive log generation app - test/e2e-framework/components/datadog/apps/ecs-chaos/: Chaos engineering app with 7 failure modes ## Infrastructure Changes: - test/e2e-framework/resources/aws/ecs/nodeGroups.go: Added NewManagedNodeGroup() for managed instances - test/e2e-framework/scenarios/aws/ecs/args.go: Added WithManagedInstanceNodeGroup() option - test/e2e-framework/scenarios/aws/ecs/cluster.go: Integrated managed instance provisioning ## Enhanced Test Framework: - test/new-e2e/tests/containers/base_test.go: Added 4 new test helper methods - testAPMTrace(): Comprehensive trace validation framework - testLogPipeline(): Enhanced log validation framework - testAgentHealth(): Agent readiness checks - testResilienceScenario(): Chaos testing framework - Exported BaseSuite[Env] type for cross-package usage - test/new-e2e/tests/containers/ecs_test.go: Enhanced existing tests + 4 new tests - TestMetadataCollection(): ECS metadata validation - TestContainerLifecycle(): Container tracking - TestTagInheritance(): Tag consistency across telemetry - TestCheckAutodiscovery(): Redis/Nginx autodiscovery ## Test Organization: All new ECS-specific tests are organized in test/new-e2e/tests/ecs/ directory for better maintainability and clarity.

agent-platform-auto-pr · 2025-12-10T13:18:16Z

Static quality checks

✅ Please find below the results from static quality gates
Comparison made with ancestor f0789c2
📊 Static Quality Gates Dashboard

Successful checks

Info

	Quality gate	Change	Size (prev → curr → max)
✅	agent_deb_amd64	N/A	N/A → 748.124 → 754.830
✅	agent_deb_amd64_fips	N/A	N/A → 708.180 → 715.320
✅	agent_heroku_amd64	N/A	N/A → 325.593 → 329.530
✅	agent_msi	N/A	N/A → 659.743 → 1072.620
✅	agent_rpm_amd64	N/A	N/A → 748.108 → 754.800
✅	agent_rpm_amd64_fips	N/A	N/A → 708.164 → 715.310
✅	agent_rpm_arm64	N/A	N/A → 727.194 → 737.340
✅	agent_rpm_arm64_fips	N/A	N/A → 689.739 → 698.930
✅	agent_suse_amd64	N/A	N/A → 748.108 → 754.800
✅	agent_suse_amd64_fips	N/A	N/A → 708.164 → 715.310
✅	agent_suse_arm64	N/A	N/A → 727.194 → 737.340
✅	agent_suse_arm64_fips	N/A	N/A → 689.739 → 698.930
✅	docker_agent_amd64	N/A	N/A → 810.611 → 817.140
✅	docker_agent_arm64	N/A	N/A → 814.292 → 824.020
✅	docker_agent_jmx_amd64	N/A	N/A → 1001.523 → 1008.020
✅	docker_agent_jmx_arm64	N/A	N/A → 993.986 → 1003.620
✅	docker_cluster_agent_amd64	N/A	N/A → 180.838 → 181.200
✅	docker_cluster_agent_arm64	N/A	N/A → 196.683 → 198.490
✅	docker_cws_instrumentation_amd64	N/A	N/A → 7.135 → 7.180
✅	docker_cws_instrumentation_arm64	N/A	N/A → 6.689 → 6.920
✅	docker_dogstatsd_amd64	N/A	N/A → 38.414 → 39.380
✅	docker_dogstatsd_arm64	N/A	N/A → 36.749 → 37.940
✅	dogstatsd_deb_amd64	N/A	N/A → 29.630 → 30.610
✅	dogstatsd_deb_arm64	N/A	N/A → 27.802 → 29.110
✅	dogstatsd_rpm_amd64	N/A	N/A → 29.630 → 30.610
✅	dogstatsd_suse_amd64	N/A	N/A → 29.630 → 30.610
✅	iot_agent_deb_amd64	N/A	N/A → 42.751 → 43.290
✅	iot_agent_deb_arm64	N/A	N/A → 39.872 → 40.920
✅	iot_agent_deb_armhf	N/A	N/A → 40.442 → 41.030
✅	iot_agent_rpm_amd64	N/A	N/A → 42.751 → 43.290
✅	iot_agent_suse_amd64	N/A	N/A → 42.751 → 43.290

On-wire sizes (compressed)

	Quality gate	Change	Size (prev → curr → max)
✅	agent_deb_amd64	N/A	N/A → 182.826 → 184.810
✅	agent_deb_amd64_fips	N/A	N/A → 174.313 → 177.560
✅	agent_heroku_amd64	N/A	N/A → 87.288 → 88.450
✅	agent_msi	N/A	N/A → 142.414 → 143.300
✅	agent_rpm_amd64	N/A	N/A → 185.807 → 188.160
✅	agent_rpm_amd64_fips	N/A	N/A → 176.734 → 178.900
✅	agent_rpm_arm64	N/A	N/A → 168.401 → 169.930
✅	agent_rpm_arm64_fips	N/A	N/A → 160.789 → 163.120
✅	agent_suse_amd64	N/A	N/A → 185.807 → 188.160
✅	agent_suse_amd64_fips	N/A	N/A → 176.734 → 178.900
✅	agent_suse_arm64	N/A	N/A → 168.401 → 169.930
✅	agent_suse_arm64_fips	N/A	N/A → 160.789 → 163.120
✅	docker_agent_amd64	N/A	N/A → 275.093 → 277.400
✅	docker_agent_arm64	N/A	N/A → 262.663 → 266.040
✅	docker_agent_jmx_amd64	N/A	N/A → 343.749 → 346.020
✅	docker_agent_jmx_arm64	N/A	N/A → 327.316 → 330.660
✅	docker_cluster_agent_amd64	N/A	N/A → 63.868 → 64.510
✅	docker_cluster_agent_arm64	N/A	N/A → 60.136 → 61.170
✅	docker_cws_instrumentation_amd64	N/A	N/A → 2.994 → 3.330
✅	docker_cws_instrumentation_arm64	N/A	N/A → 2.726 → 3.090
✅	docker_dogstatsd_amd64	N/A	N/A → 14.863 → 15.820
✅	docker_dogstatsd_arm64	N/A	N/A → 14.202 → 14.830
✅	dogstatsd_deb_amd64	N/A	N/A → 7.831 → 8.790
✅	dogstatsd_deb_arm64	N/A	N/A → 6.718 → 7.710
✅	dogstatsd_rpm_amd64	N/A	N/A → 7.843 → 8.800
✅	dogstatsd_suse_amd64	N/A	N/A → 7.843 → 8.800
✅	iot_agent_deb_amd64	N/A	N/A → 11.211 → 12.040
✅	iot_agent_deb_arm64	N/A	N/A → 9.584 → 10.450
✅	iot_agent_deb_armhf	N/A	N/A → 9.782 → 10.620
✅	iot_agent_rpm_amd64	N/A	N/A → 11.230 → 12.060
✅	iot_agent_suse_amd64	N/A	N/A → 11.230 → 12.060

The ecs-chaos and ecs-multiservice apps were using incorrect types from the awsx/ecs package (ecs.TaskDefinitionVolumeArray/Args) which don't exist. Fixed by importing the classic ECS package and using the correct types: - classicECS.TaskDefinitionVolumeArray - classicECS.TaskDefinitionVolumeArgs This matches the pattern used in other ECS apps (dogstatsd, tracegen, nginx).

cit-pr-commenter · 2025-12-10T13:31:52Z

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: dd763ea2-f8bb-4c47-9338-0c1213f8af86

Baseline: d5fa254
Comparison: 194b80d
Diff

Optimization Goals: ✅ No significant changes detected

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	docker_containers_cpu	% cpu utilization	-2.68	[-5.65, +0.29]	1	Logs

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	quality_gate_metrics_logs	memory utilization	+1.33	[+1.11, +1.55]	1	Logs bounds checks dashboard
➖	file_tree	memory utilization	+0.89	[+0.83, +0.95]	1	Logs
➖	ddot_metrics	memory utilization	+0.78	[+0.56, +1.00]	1	Logs
➖	quality_gate_logs	% cpu utilization	+0.67	[-0.82, +2.16]	1	Logs bounds checks dashboard
➖	ddot_metrics_sum_cumulativetodelta_exporter	memory utilization	+0.33	[+0.09, +0.56]	1	Logs
➖	tcp_syslog_to_blackhole	ingress throughput	+0.29	[+0.21, +0.38]	1	Logs
➖	otlp_ingest_logs	memory utilization	+0.19	[+0.09, +0.30]	1	Logs
➖	ddot_metrics_sum_delta	memory utilization	+0.15	[-0.07, +0.36]	1	Logs
➖	ddot_metrics_sum_cumulative	memory utilization	+0.11	[-0.05, +0.28]	1	Logs
➖	docker_containers_memory	memory utilization	+0.08	[+0.01, +0.15]	1	Logs
➖	file_to_blackhole_1000ms_latency	egress throughput	+0.07	[-0.35, +0.49]	1	Logs
➖	file_to_blackhole_500ms_latency	egress throughput	+0.04	[-0.33, +0.41]	1	Logs
➖	uds_dogstatsd_to_api	ingress throughput	+0.03	[-0.11, +0.17]	1	Logs
➖	quality_gate_idle_all_features	memory utilization	+0.02	[-0.02, +0.05]	1	Logs bounds checks dashboard
➖	tcp_dd_logs_filter_exclude	ingress throughput	+0.01	[-0.07, +0.08]	1	Logs
➖	file_to_blackhole_0ms_latency	egress throughput	-0.01	[-0.42, +0.40]	1	Logs
➖	uds_dogstatsd_to_api_v3	ingress throughput	-0.02	[-0.15, +0.11]	1	Logs
➖	file_to_blackhole_100ms_latency	egress throughput	-0.07	[-0.12, -0.02]	1	Logs
➖	quality_gate_idle	memory utilization	-0.13	[-0.18, -0.08]	1	Logs bounds checks dashboard
➖	uds_dogstatsd_20mb_12k_contexts_20_senders	memory utilization	-0.17	[-0.23, -0.12]	1	Logs
➖	ddot_logs	memory utilization	-0.28	[-0.35, -0.21]	1	Logs
➖	otlp_ingest_metrics	memory utilization	-0.47	[-0.63, -0.32]	1	Logs
➖	docker_containers_cpu	% cpu utilization	-2.68	[-5.65, +0.29]	1	Logs

Bounds Checks: ✅ Passed

perf	experiment	bounds_check_name	replicates_passed	links
✅	docker_containers_cpu	simple_check_run	10/10
✅	docker_containers_memory	memory_usage	10/10
✅	docker_containers_memory	simple_check_run	10/10
✅	file_to_blackhole_0ms_latency	lost_bytes	10/10
✅	file_to_blackhole_0ms_latency	memory_usage	10/10
✅	file_to_blackhole_1000ms_latency	lost_bytes	10/10
✅	file_to_blackhole_1000ms_latency	memory_usage	10/10
✅	file_to_blackhole_100ms_latency	lost_bytes	10/10
✅	file_to_blackhole_100ms_latency	memory_usage	10/10
✅	file_to_blackhole_500ms_latency	lost_bytes	10/10
✅	file_to_blackhole_500ms_latency	memory_usage	10/10
✅	quality_gate_idle	intake_connections	10/10	bounds checks dashboard
✅	quality_gate_idle	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_idle_all_features	intake_connections	10/10	bounds checks dashboard
✅	quality_gate_idle_all_features	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_logs	intake_connections	10/10	bounds checks dashboard
✅	quality_gate_logs	lost_bytes	10/10	bounds checks dashboard
✅	quality_gate_logs	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_metrics_logs	cpu_usage	10/10	bounds checks dashboard
✅	quality_gate_metrics_logs	intake_connections	10/10	bounds checks dashboard
✅	quality_gate_metrics_logs	lost_bytes	10/10	bounds checks dashboard
✅	quality_gate_metrics_logs	memory_usage	10/10	bounds checks dashboard

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

CI Pass/Fail Decision

✅ Passed. All Quality Gates passed.

quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.

Consolidates all ECS-specific tests from containers/ folder into the new test/new-e2e/tests/ecs/ directory, removing duplicates and improving organization. ## Changes ### New Test Files (506 lines) - **checks_test.go** (306 lines): Check autodiscovery and execution tests - TestNginxECS, TestRedisECS (EC2 with autodiscovery) - TestNginxFargate, TestRedisFargate (Fargate check execution) - TestPrometheus (Prometheus/OpenMetrics integration) - **platform_test.go** (200 lines): Platform-specific feature tests - TestWindowsFargate (Windows container support on Fargate) - TestCPU (CPU metrics with value validation) - TestContainerLifecycle (Container lifecycle tracking) ### Cleanup (821 lines removed) - **containers/ecs_test.go**: Reduced from 983 to 162 lines - Removed 8 unique tests (moved to ecs/ folder) - Removed 8 duplicate tests (covered by new ecs tests) - Kept Test00UpAndRunning (foundation test) - Added package documentation explaining its role ### Test Classification **Moved to ecs/ folder** (8 tests): - TestNginxECS, TestRedisECS, TestNginxFargate, TestRedisFargate, TestPrometheus → checks_test.go - TestWindowsFargate, TestCPU, TestContainerLifecycle → platform_test.go **Removed as duplicates** (8 tests): - TestDogtstatsdUDS/UDP - Covered in apm_test.go - TestTraceUDS/TCP - Covered in apm_test.go (TestAPMEC2) - TestMetadataCollection - Covered in config_test.go (TestMetadataEndpoints) - TestTagInheritance - Covered in config_test.go and logs_test.go - TestCheckAutodiscovery - Covered in config_test.go ## Test Organization After Changes test/new-e2e/tests/ ├── containers/ │ └── ecs_test.go (162 lines - foundation test only) └── ecs/ ├── README.md (755 lines - comprehensive documentation) ├── apm_test.go (416 lines - 8 tests) ├── logs_test.go (482 lines - 9 tests) ├── config_test.go (543 lines - 7 tests) ├── resilience_test.go (458 lines - 8 tests) ├── managed_test.go (527 lines - 12 tests) ├── checks_test.go (306 lines - 5 tests) ← NEW └── platform_test.go (200 lines - 3 tests) ← NEW ## Benefits - Better test organization by functionality - No duplicate test coverage - All ECS-specific tests in one dedicated directory - Foundation test remains in containers/ for infrastructure validation - 52 total ECS tests across 7 suites Relates to JIRA ticket EXP-133.

Fixed incorrect closing brace (}) to closing parenthesis ()) in WithECSOptions() calls in three test files: - apm_test.go - config_test.go - logs_test.go This resolves the linter failures that prevented compilation.

…g framework Added release note documenting the new ECS E2E testing infrastructure including 7 test suites, 3 test applications, and support for all ECS deployment types (Fargate, EC2, Managed Instances).

…iences Updated all ownership references across ECS test infrastructure: - test/new-e2e/tests/ecs/README.md - test/e2e-framework/components/datadog/apps/ecs-multiservice/ (3 files) - test/e2e-framework/components/datadog/apps/ecs-chaos/ (3 files) - test/e2e-framework/components/datadog/apps/ecs-log-generator/ (4 files) The correct team name is "ecs-experiences" not "containers/orchestrator".

Fixed linter errors in ECS test files: - Changed suite.baseSuite to suite.BaseSuite in all SetupSuite methods - Removed duplicate getKeys function from logs_test.go - Removed duplicate getMapKeys function from resilience_test.go All tests now compile successfully.

- Created base_helpers.go with all Test*Args types and helper methods - Moved BaseSuite to base.go for external test access - Fixed containers package compilation (now compiles successfully) - Updated ECS test files to package ecs and import containers - Fixed time.Minute/time.Second references - Fixed pb.Span type references Note: Some compilation errors remain in ECS test files that need individual attention (GetMetrics/GetLogs method calls, lo.Filter usage)

- Created helpers.go with getAllMetrics() and getAllLogs() functions - Fixed pb.Span pointer type in apm_test.go - Fixed aggregator.TracePayload types in lo.Filter calls - Replaced all GetMetrics()/GetLogs() calls with helper functions - Fixed metric.GetMetricName() → metric.Metric - Fixed log.GetSource() → log.Source Remaining issues to fix: - Missing helper functions: getKeys, filterLogsByTag, getTagValue, truncateString - Fix metric.Resources field access - Fix log.GetMessage() → log.Message

…ith helper functions This completes the refactoring to make ECS tests work after fakeintake's GetMetrics() and GetLogs() methods were made private. Changes: - Created getAllMetrics() and getAllLogs() helper functions in helpers.go that use public APIs (GetMetricNames + FilterMetrics, GetLogServiceNames + FilterLogs) - Added utility helper functions: getKeys, getMapKeys, filterLogsByTag, getTagValue, truncateString - Replaced all suite.Fakeintake.GetMetrics() calls with getAllMetrics(suite.Fakeintake) - Replaced all suite.Fakeintake.GetLogs() calls with getAllLogs(suite.Fakeintake) - Fixed method calls to use direct field access: * metric.GetMetricName() → metric.Metric * log.GetSource() → log.Source * log.GetMessage() → log.Message * log.GetStatus() → log.Status * log.GetTimestamp() → log.Timestamp * metric.Resources[0].Timestamp → metric.GetCollectedTime().Unix() - Fixed suite.Fakeintake.FlushData() → FlushServerAndResetAggregators() - Fixed type mismatches: * spansByID map to use *pb.Span pointers * lo.Filter calls to use *aggregator.TracePayload * getMapKeys calls to use getKeys for map[string]bool - Removed unused imports (aggregator, fakeintake, time) Result: Both containers and ecs packages now compile successfully. Related: EXP-133

Update README examples to use current fakeintake API: - Replace GetMetrics() with getAllMetrics() helper function - Replace GetLogs() with getAllLogs() helper function - Replace FlushData() with FlushServerAndResetAggregators() - Replace m.GetMetricName() with m.Metric - Update time constants to use time.Minute instead of suite.Minute These changes ensure the documentation matches the current implementation after the fakeintake API was refactored to make methods private. Related: EXP-133

…older This commit addresses the concern that we made too many changes to the containers folder, which we don't own. Changes: 1. Moved base.go and base_helpers.go from containers/ to ecs/ - Changed package declaration from 'containers' to 'ecs' - Added assertTags() function (previously in utils.go) - ECS tests now use their own BaseSuite without importing containers 2. Updated all ECS test files to remove containers import - Changed containers.BaseSuite → BaseSuite - Changed &containers.TestXXX → &TestXXX - All helper types now available directly in ecs package 3. Restored containers/ folder to origin/main state - Reverted all modifications to docker_test.go, eks_test.go, etc. - Reverted base_test.go to its original state - Removed base.go and base_helpers.go from containers/ - Containers package restored to pre-refactoring state Result: - ✅ ECS package compiles with its own BaseSuite - ✅ Containers package compiles with original structure - ✅ No cross-dependency between packages - ✅ Minimal changes to containers/ folder (owned by another team) Related: EXP-133

…kage This commit completes the migration of all ECS-specific tests from the containers package to the dedicated ecs package, ensuring clean ownership boundaries between teams. Changes: - Moved Test00UpAndRunning infrastructure test to ecs/apm_test.go * This foundation test waits for all ECS tasks to be ready * Acts as warmup for agent tagger before other tests run * Renamed Test00AgentAPMReady to Test01AgentAPMReady to maintain correct execution order (infrastructure check runs first) - Moved DogStatsD transport tests to ecs/apm_test.go * TestDogtstatsdUDS - DogStatsD over Unix Domain Socket * TestDogtstatsdUDP - DogStatsD over UDP * testDogstatsd helper method - Moved Trace transport tests to ecs/apm_test.go * TestTraceUDS - Tracing over Unix Domain Socket * TestTraceTCP - Tracing over TCP * testTrace helper method - Deleted test/new-e2e/tests/containers/ecs_test.go entirely * All unique tests have been migrated to ecs/ * All duplicate tests were already covered by new ecs/ tests * Containers team no longer has ECS test ownership Result: - Clean separation: containers/ package has NO ECS tests - All 52+ ECS tests now consolidated in ecs/ package - ECS team owns all ECS tests in one location - Both packages compile successfully Testing: Verified both packages compile with appropriate build tags

Applied fixes from Go code reviewer analysis to improve code quality and address critical issues identified in apm_test.go. Changes: 1. Fix context leak (Critical) - Changed Test00UpAndRunning to use suite.T().Context() instead of context.Background() - Ensures proper cancellation when test completes or times out - Prevents goroutine leaks from AWS SDK operations 2. Extract duplicated tag validation patterns (Major) - Created getCommonECSTagPatterns() helper function to centralize tag pattern definitions - Reduces ~60 lines of duplicated code between testDogstatsd and testTrace methods - Improves maintainability by having single source of truth for ECS tag expectations - Supports both full tag sets (metrics) and minimal sets (traces) 3. Fixed terminology in comment - Updated testTrace comment from "container and pod tags" to "container and ECS task tags" (ECS uses tasks, not pods) Benefits: - Eliminates potential goroutine leaks from long-running AWS operations - Centralizes tag validation logic for easier updates - Reduces code duplication and maintenance burden - Makes tag expectations consistent across test methods Testing: Verified package compiles successfully with Go build tags

- Run gofmt on all ECS test files - Add package comment to ecs package - Remove ineffectual assignment in TestTraceSampling - Remove unused baseSuite type alias from ecs/base.go - Remove unused getMapKeys function from ecs/helpers.go - Remove unused testCheckRun code from containers/base_test.go

- Remove TestECSSuite from new-e2e-containers job (test was deleted) - Add new-e2e-ecs job to run all 7 ECS test suites in parallel - Update skip pattern to remove ECS (now has dedicated job) - Target: ./tests/ecs with team ecs-experiences Test suites: - TestECSAPMSuite (APM/tracing) - TestECSLogsSuite (log collection) - TestECSConfigSuite (configuration) - TestECSResilienceSuite (resilience) - TestECSManagedSuite (managed instances) - TestECSChecksSuite (check autodiscovery) - TestECSPlatformSuite (platform-specific)

agent-platform-auto-pr · 2025-12-12T10:40:55Z

Gitlab CI Configuration Changes

Added Jobs

new-e2e-ecs

new-e2e-ecs:
  after_script:
  - CODECOV_TOKEN=$($CI_PROJECT_DIR/tools/ci/fetch_secret.sh $CODECOV token) || exit
    $?; export CODECOV_TOKEN
  - $CI_PROJECT_DIR/tools/ci/junit_upload.sh "junit-${CI_JOB_ID}.tgz" "$E2E_RESULT_JSON"
  - "if [ -d \"$E2E_COVERAGE_OUT_DIR\" ]; then\n  dda inv -- -e coverage.process-e2e-coverage-folders\
    \ $E2E_COVERAGE_OUT_DIR\n  dda inv -- -e dyntest.compute-and-upload-job-index\
    \ --bucket-uri $S3_PERMANENT_ARTIFACTS_URI --coverage-folder $E2E_COVERAGE_OUT_DIR\
    \ --commit-sha $CI_COMMIT_SHA --job-id $CI_JOB_ID\nfi\n"
  artifacts:
    expire_in: 2 weeks
    paths:
    - $E2E_OUTPUT_DIR
    - $E2E_RESULT_JSON
    - junit-*.tgz
    - $E2E_COVERAGE_OUT_DIR
    reports:
      annotations:
      - $EXTERNAL_LINKS_PATH
    when: always
  before_script:
  - mkdir -p $GOPATH/pkg/mod/cache && zstd -dc modcache_e2e.tar.zst | tar xf - -C
    $GOPATH/pkg/mod/cache
  - rm -f modcache_e2e.tar.zst
  - mkdir -p ~/.pulumi && zstd -dc pulumi_plugins.tar.zst | tar xf - -C ~/.pulumi
  - rm -f pulumi_plugins.tar.zst
  - mkdir -p $GOPATH/bin $GOPATH/pkg/mod/cache && zstd -dc modcache_tools.tar.zst
    | tar xf - -C $GOPATH
  - rm -f modcache_tools.tar.zst
  - export PATH=$PATH:$GOPATH/bin
  - mkdir -p ~/.aws
  - "if [ -n \"$E2E_USE_AWS_PROFILE\" ]; then\n  echo Using agent-qa-ci aws profile\n\
    \  $CI_PROJECT_DIR/tools/ci/fetch_secret.sh $AGENT_QA_E2E profile >> ~/.aws/config\
    \ || exit $?\n  # Now all `aws` commands target the agent-qa profile\n  export\
    \ AWS_PROFILE=agent-qa-ci\nelse\n  # Assume role to fetch only once credentials\
    \ and avoid rate limits\n  echo Assuming ddbuild-agent-ci role\n  roleoutput=\"\
    $(aws sts assume-role --role-arn arn:aws:iam::669783387624:role/ddbuild-agent-ci\
    \ --external-id ddbuild-agent-ci --role-session-name RoleSession)\"\n  export\
    \ AWS_ACCESS_KEY_ID=\"$(echo \"$roleoutput\" | jq -r '.Credentials.AccessKeyId')\"\
    \n  export AWS_SECRET_ACCESS_KEY=\"$(echo \"$roleoutput\" | jq -r '.Credentials.SecretAccessKey')\"\
    \n  export AWS_SESSION_TOKEN=\"$(echo \"$roleoutput\" | jq -r '.Credentials.SessionToken')\"\
    \nfi\n"
  - $CI_PROJECT_DIR/tools/ci/fetch_secret.sh $AGENT_QA_E2E ssh_public_key_rsa > $E2E_AWS_PUBLIC_KEY_PATH
    || exit $?
  - touch $E2E_AWS_PRIVATE_KEY_PATH && chmod 600 $E2E_AWS_PRIVATE_KEY_PATH && $CI_PROJECT_DIR/tools/ci/fetch_secret.sh
    $AGENT_QA_E2E ssh_key_rsa > $E2E_AWS_PRIVATE_KEY_PATH || exit $?
  - $CI_PROJECT_DIR/tools/ci/fetch_secret.sh $AGENT_QA_E2E ssh_public_key_rsa > $E2E_AZURE_PUBLIC_KEY_PATH
    || exit $?
  - touch $E2E_AZURE_PRIVATE_KEY_PATH && chmod 600 $E2E_AZURE_PRIVATE_KEY_PATH &&
    $CI_PROJECT_DIR/tools/ci/fetch_secret.sh $AGENT_QA_E2E ssh_key_rsa > $E2E_AZURE_PRIVATE_KEY_PATH
    || exit $?
  - $CI_PROJECT_DIR/tools/ci/fetch_secret.sh $AGENT_QA_E2E ssh_public_key_rsa > $E2E_GCP_PUBLIC_KEY_PATH
    || exit $?
  - touch $E2E_GCP_PRIVATE_KEY_PATH && chmod 600 $E2E_GCP_PRIVATE_KEY_PATH && $CI_PROJECT_DIR/tools/ci/fetch_secret.sh
    $AGENT_QA_E2E ssh_key_rsa > $E2E_GCP_PRIVATE_KEY_PATH || exit $?
  - pulumi login "s3://dd-pulumi-state?region=us-east-1&awssdk=v2&profile=$AWS_PROFILE"
  - ARM_CLIENT_ID=$($CI_PROJECT_DIR/tools/ci/fetch_secret.sh $E2E_AZURE client_id)
    || exit $?; export ARM_CLIENT_ID
  - ARM_CLIENT_SECRET=$($CI_PROJECT_DIR/tools/ci/fetch_secret.sh $E2E_AZURE token)
    || exit $?; export ARM_CLIENT_SECRET
  - ARM_TENANT_ID=$($CI_PROJECT_DIR/tools/ci/fetch_secret.sh $E2E_AZURE tenant_id)
    || exit $?; export ARM_TENANT_ID
  - ARM_SUBSCRIPTION_ID=$($CI_PROJECT_DIR/tools/ci/fetch_secret.sh $E2E_AZURE subscription_id)
    || exit $?; export ARM_SUBSCRIPTION_ID
  - $CI_PROJECT_DIR/tools/ci/fetch_secret.sh $E2E_GCP credentials_json > ~/gcp-credentials.json
    || exit $?
  - export GOOGLE_APPLICATION_CREDENTIALS=~/gcp-credentials.json
  - 'gcp_acr_key=$($CI_PROJECT_DIR/tools/ci/fetch_secret.sh $E2E_GCP credentials_acr_readonly)
    || exit $?

    export E2E_GCP_IMAGE_PULL_PASSWORD="b64=$(printf ''%s'' "$gcp_acr_key" | base64
    -w 0)"

    '
  - dda inv -- -e gitlab.generate-ci-visibility-links --output=$EXTERNAL_LINKS_PATH
  - export DD_ENV=nativetest
  - export DD_CIVISIBILITY_ENABLED=true
  - export DD_CIVISIBILITY_AGENTLESS_ENABLED=true
  - export DD_CIVISIBILITY_FLAKY_RETRY_ENABLED=false
  - export DD_TAGS="gitlab.pipeline_source:${CI_PIPELINE_SOURCE}"
  - DD_API_KEY=$($CI_PROJECT_DIR/tools/ci/fetch_secret.sh $AGENT_API_KEY_ORG2 token)
    || exit $?; export DD_API_KEY
  - export WINDOWS_DDNPM_DRIVER=${WINDOWS_DDNPM_DRIVER:-$(dda inv release.get-release-json-value
    "dependencies::WINDOWS_DDNPM_DRIVER" --no-worktree)}
  - export WINDOWS_DDPROCMON_DRIVER=${WINDOWS_DDPROCMON_DRIVER:-$(dda inv release.get-release-json-value
    "dependencies::WINDOWS_DDPROCMON_DRIVER" --no-worktree)}
  image: registry.ddbuild.io/ci/datadog-agent-buildimages/linux$CI_IMAGE_LINUX_SUFFIX:$CI_IMAGE_LINUX
  needs:
  - go_e2e_deps
  - artifacts: false
    job: go_e2e_test_binaries
  - go_tools_deps
  - job: new-e2e-base-coverage
    optional: true
  - qa_agent_linux
  - qa_agent_linux_jmx
  - qa_dca
  - qa_dogstatsd
  retry:
    exit_codes:
    - 42
    max: 2
    when:
    - runner_system_failure
    - stuck_or_timeout_failure
    - unknown_failure
    - api_failure
    - scheduler_failure
    - stale_schedule
    - data_integrity_failure
  rules:
  - if: $RUN_E2E_TESTS == "off"
    when: never
  - if: $CI_COMMIT_BRANCH =~ /^mq-working-branch-/
    when: never
  - if: $RUN_E2E_TESTS == "on"
    when: on_success
  - if: $CI_COMMIT_BRANCH == "main"
    when: on_success
  - if: $CI_COMMIT_BRANCH =~ /^[0-9]+\.[0-9]+\.x$/
    when: on_success
  - if: $CI_COMMIT_TAG =~ /^[0-9]+\.[0-9]+\.[0-9]+-rc\.[0-9]+$/
    when: on_success
  - changes:
      compare_to: $COMPARE_TO_BRANCH
      paths:
      - .gitlab/test/e2e/e2e.yml
      - test/e2e-framework/**/*
      - test/new-e2e/go.mod
      - flakes.yaml
      - release.json
  - changes:
      compare_to: $COMPARE_TO_BRANCH
      paths:
      - comp/core/tagger/**/*
      - comp/core/workloadmeta/**/*
      - comp/core/autodiscovery/listeners/**/*
      - comp/core/autodiscovery/providers/**/*
      - comp/forwarder/defaultforwarder/**/*
      - comp/serializer/**/*
      - pkg/aggregator/*/**
      - comp/languagedetection/**/*
      - pkg/clusteragent/admission/mutate/**/*
      - pkg/clusteragent/languagedetection/**/*
      - pkg/collector/corechecks/cluster/**/*
      - pkg/collector/corechecks/containers/**/*
      - pkg/collector/corechecks/containerimage/**/*
      - pkg/collector/corechecks/containerlifecycle/**/*
      - pkg/collector/corechecks/sbom/**/*
      - pkg/sbom/**/*
      - pkg/util/clusteragent/**/*
      - pkg/util/containerd/**/*
      - pkg/util/containers/**/*
      - pkg/util/docker/**/*
      - pkg/util/ecs/**/*
      - pkg/util/kubernetes/**/*
      - pkg/util/cgroups/**/*
      - pkg/util/trivy/**/*
      - test/new-e2e/tests/containers/**/*
      - test/new-e2e/go.mod
    when: on_success
  - if: $CI_COMMIT_BRANCH =~ /^mq-working-branch-/
    when: never
  - allow_failure: true
    when: manual
  script:
  - export IS_DEV_BRANCH="$(dda inv -- -e pipeline.is-dev-branch)"
  - DYNAMIC_TESTS_BREAKGLASS=$($CI_PROJECT_DIR/tools/ci/fetch_secret.sh $DYNAMIC_TESTS_BREAKGLASS
    value) || exit $?; export DYNAMIC_TESTS_BREAKGLASS
  - "if [ \"$DYNAMIC_TESTS_BREAKGLASS\" == \"true\" ] || [ \"$IS_DEV_BRANCH\" == \"\
    false\" ] || [ \"$RUN_E2E_TESTS\" == \"on\" ]; then\n  export DYNAMIC_TESTS_FLAG=\"\
    \"\nfi\n"
  - export E2E_IMAGE_PULL_PASSWORD=$(aws ecr get-login-password),$E2E_GCP_IMAGE_PULL_PASSWORD
  - dda inv -- -e new-e2e-tests.run $DYNAMIC_TESTS_FLAG $PRE_BUILT_BINARIES_FLAG $MAX_RETRIES_FLAG
    --local-package $CI_PROJECT_DIR/$OMNIBUS_BASE_DIR --result-json $E2E_RESULT_JSON
    --targets $TARGETS --junit-tar junit-${CI_JOB_ID}.tgz ${EXTRA_PARAMS} --test-washer
    --logs-folder=$E2E_OUTPUT_DIR/logs --logs-post-processing --logs-post-processing-test-depth=$E2E_LOGS_PROCESSING_TEST_DEPTH
  stage: e2e
  tags:
  - arch:amd64
  - specific:true
  variables:
    DYNAMIC_TESTS_FLAG: --impacted
    E2E_AWS_PRIVATE_KEY_PATH: /tmp/agent-qa-aws-ssh-key
    E2E_AWS_PUBLIC_KEY_PATH: /tmp/agent-qa-aws-ssh-key.pub
    E2E_AZURE_PRIVATE_KEY_PATH: /tmp/agent-qa-azure-ssh-key
    E2E_AZURE_PUBLIC_KEY_PATH: /tmp/agent-qa-azure-ssh-key.pub
    E2E_COMMIT_SHA: $CI_COMMIT_SHORT_SHA
    E2E_COVERAGE_OUT_DIR: $CI_PROJECT_DIR/coverage
    E2E_GCP_PRIVATE_KEY_PATH: /tmp/agent-qa-gcp-ssh-key
    E2E_GCP_PUBLIC_KEY_PATH: /tmp/agent-qa-gcp-ssh-key.pub
    E2E_IMAGE_PULL_REGISTRY: 669783387624.dkr.ecr.us-east-1.amazonaws.com,us-central1-docker.pkg.dev
    E2E_IMAGE_PULL_USERNAME: AWS,_json_key
    E2E_KEY_PAIR_NAME: datadog-agent-ci-rsa
    E2E_LOGS_PROCESSING_TEST_DEPTH: 1
    E2E_OUTPUT_DIR: $CI_PROJECT_DIR/e2e-output
    E2E_PIPELINE_ID: $CI_PIPELINE_ID
    E2E_PREBUILD_S3_URI: $S3_PERMANENT_ARTIFACTS_URI/e2e-pre-build/$CI_PIPELINE_ID
    E2E_RESULT_JSON: $CI_PROJECT_DIR/e2e_test_output.json
    E2E_SKIP_WINDOWS: $SKIP_WINDOWS
    E2E_USE_AWS_PROFILE: 'true'
    EXTERNAL_LINKS_PATH: external_links_$CI_JOB_ID.json
    FLAKY_PATTERNS_CONFIG: $CI_PROJECT_DIR/flaky-patterns-runtime.yaml
    GIT_STRATEGY: clone
    KUBERNETES_CPU_REQUEST: 6
    KUBERNETES_MEMORY_LIMIT: 16Gi
    KUBERNETES_MEMORY_REQUEST: 12Gi
    MAX_RETRIES_FLAG: ''
    ON_NIGHTLY_FIPS: 'true'
    PRE_BUILT_BINARIES_FLAG: --use-prebuilt-binaries
    REMOTE_STACK_CLEANING: 'true'
    SHOULD_RUN_IN_FLAKES_FINDER: 'true'
    TARGETS: ./tests/ecs
    TEAM: ecs-experiences

Removed Jobs

new-e2e-containers-ecs

Changes Summary

Removed	Modified	Added	Renamed
1	0	1	0

ℹ️ Diff available in the job log.

…ify reflection panic Testify's suite runner uses reflection to discover and run test methods by looking for methods that start with "Test". The helper methods in BaseSuite were incorrectly named with "Test" prefix, causing testify to try running them as test methods. Since these helpers expect arguments (e.g., *TestMetricArgs), reflect panicked with "Call with too few input arguments". Renamed 8 helper methods: - TestMetric → AssertMetric - TestLog → AssertLog - TestCheckRun → AssertCheckRun - TestEvent → AssertEvent - TestAPMTrace → AssertAPMTrace - TestLogPipeline → AssertLogPipeline - TestAgentHealth → AssertAgentHealth - TestResilienceScenario → AssertResilienceScenario Updated all call sites in: - apm_test.go - checks_test.go - logs_test.go - managed_test.go - platform_test.go This fixes the panic that was causing 9 tests to fail in TestECSChecksSuite.

This fixes the race condition where ECS services (nginx-ec2, redis-ec2) timeout in PENDING state because they are created before EC2 container instances have registered with the cluster. Root cause: - AutoScaling Group creates 1 t3.medium instance (takes 1-2 minutes) - Instance must launch, run userdata, and register with ECS cluster - EC2 services are created immediately and try to place tasks - Services timeout after 20 minutes waiting for available capacity Solution: - Added WaitForContainerInstances() function that polls ECS API - Waits up to 5 minutes for at least 1 ACTIVE container instance - Integrated into run.go before creating testing workloads - Only runs when EC2 node groups are configured Changes: - test/e2e-framework/resources/aws/ecs/wait.go (new file) - WaitForContainerInstances() using AWS SDK v2 - Polls every 10 seconds with 5 minute timeout - Pulumi-integrated via ApplyT for proper dependency ordering - test/e2e-framework/scenarios/aws/ecs/run.go - Added wait call before testing workloads (line 101-106) - Checks if any node group type is enabled - Ensures proper sequencing: cluster → instances → services This prevents the 20-minute timeout failures seen in TestECSConfigSuite and TestECSChecksSuite when nginx and redis services can't be placed.

…ings The actual root cause was insufficient capacity for services using static host port mappings with bridge networking. Problem: - nginx and redis services both have DesiredCount=2 - Both use bridge networking with static host ports (80 and 6379) - With bridge networking, only ONE task per host port per instance - AutoScaling Group was configured with min=1, desired=1, max=2 - Only 1 instance available, but services need 2 instances total Result: - First nginx task: placed on instance 1, binds to port 80 - Second nginx task: CAN'T be placed (port 80 already in use) - First redis task: placed on instance 1, binds to port 6379 - Second redis task: CAN'T be placed (port 6379 already in use) - Services stay in PENDING state for 20 minutes, then timeout Solution: - Increased AutoScaling Group from (min=1, desired=1, max=2) to (min=2, desired=2, max=4) - Updated WaitForContainerInstances to wait for 2 instances This ensures enough capacity for all tasks with static port mappings: - 2 nginx tasks need 2 instances (1 task per instance on port 80) - 2 redis tasks need 2 instances (1 task per instance on port 6379) - Plus additional services (cpustress, dogstatsd, prometheus, tracegen) Changes: - test/e2e-framework/resources/aws/ecs/nodeGroups.go: ASG size 2->2->4 - test/e2e-framework/scenarios/aws/ecs/run.go: wait for 2 instances

Add DD_APM_DD_URL configuration to ecsFakeintakeAdditionalEndpointsEnv() to ensure APM traces are sent to FakeIntake for validation. This fixes 17 APM tests that were failing with 'No traces found' errors. The agent was already configured to accept traces (DD_APM_ENABLED=true) but was missing the endpoint configuration to send them to FakeIntake.

Resolve merge conflicts: - CODEOWNERS: keep new /tests/ecs/ ownership + new cluster_agent_sgc entry from main - e2e.yml: keep new-e2e-ecs job + openshift comment from main - go.mod: drop removed gopkg.in/ini.v1 dependency (removed in main)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ba4d5574ed

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-09T17:10:49Z

+
+	// Managed instances use similar configuration but with ECS-managed ASG
+	// For testing purposes, we create a standard node group that ECS will manage
+	return newNodeGroup(e, "managed-ng", pulumi.String(ecsAmi.Value), pulumi.String(e.DefaultInstanceType()), getUserData(linuxInitUserData, clusterName))


Use ECS-managed capacity for managed-node tests

NewManagedNodeGroup is documented as provisioning ECS-managed instances, but it currently delegates to newNodeGroup, which creates the same ASG-backed capacity-provider path used by regular EC2 node groups. That means the new managed-instance suite is not actually exercising managed-instance behavior (scaling/lifecycle semantics), so it can pass while regressions specific to real ECS-managed instances remain undetected.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-09T17:10:49Z

+	}
+
 	// Testing workload if at least one EC2 node group is present
 	if params.testingWorkload && isEC2ProviderSet(clusterParams) {


Treat managed node groups as EC2 providers

This testingWorkload gate depends on isEC2ProviderSet, but that helper does not include ManagedInstanceNodeGroup. As a result, when a stack enables only WithManagedInstanceNodeGroup(), the scenario skips EC2 workload deployment and the container-instance readiness wait path even though managed instances are EC2-backed capacity. This makes managed-mode runs behave differently from intended and weakens coverage for the new managed suite.

Useful? React with 👍 / 👎.

L3n41c · 2026-04-09T21:19:59Z

+		ctx := context.Background()
+		cfg, err := awsconfig.LoadDefaultConfig(ctx,


Use Pulumi context instead of context.Background().

Suggested change

ctx := context.Background()

cfg, err := awsconfig.LoadDefaultConfig(ctx,

cfg, err := awsconfig.LoadDefaultConfig(e.Ctx().Context(),

L3n41c · 2026-04-09T21:20:18Z

+				return "", fmt.Errorf("timeout waiting for container instances after %v", maxWaitTime)
+			}
+
+			listOutput, err := ecsClient.ListContainerInstances(ctx, &ecs.ListContainerInstancesInput{


Suggested change

listOutput, err := ecsClient.ListContainerInstances(ctx, &ecs.ListContainerInstancesInput{

listOutput, err := ecsClient.ListContainerInstances(e.Ctx().Context(), &ecs.ListContainerInstancesInput{

L3n41c · 2026-04-09T21:26:24Z

+			ecs.TaskDefinitionKeyValuePairArgs{
+				Name:  pulumi.StringPtr("DD_LOG_LEVEL"),
+				Value: pulumi.StringPtr("debug"),
+			},


Just a warning that, in the past, being too verbose led to issues with the fake intake like OOM kills or degraded response time.
Indeed, the filtering is done on the client side of the fake intake so that, any single-line log assertion leads to a download from the fake intake server of all the debug logs that it received from the agents.

L3n41c · 2026-04-09T21:29:20Z

I think the release notes are meant to be “customer centric” and they should contain only changes that affect what we deliver to the customers.
It’s basically a way to communicate to our customers what they should expect if they upgrade the version of their agent.
So, I’m not sure it’s worth having a release note for tests.

L3n41c · 2026-04-09T21:41:50Z

+				scenecs.WithFargateCapacityProvider(),
+				scenecs.WithLinuxNodeGroup(),
+				scenecs.WithLinuxBottleRocketNodeGroup(),
+				scenecs.WithWindowsNodeGroup(),


The pull request #46377 made the Windows parts of e2e tests opt-in: https://github.com/DataDog/datadog-agent/pull/46377/changes#diff-9543eb0592b402752226e7a4506f6abebbe90e62fa58b20b5603451e042d4892

The goal was to not have to wait for the long-running Windows jobs to finish building the Windows agent before starting the e2e tests for the majority of pull requests that don’t touch Windows-specific pieces of code.

This was a mitigation to the issue that Windows jobs are lasting significantly more time than the Linux ones. It allowed to have a quicker CI feedback as the e2e tests jobs could start earlier.

Unconditionally requiring Windows will be a regression in the average whole CI pipeline execution time.

L3n41c · 2026-04-09T21:51:17Z

+      - EXTRA_PARAMS: --run TestECSAPMSuite
+      - EXTRA_PARAMS: --run TestECSManagedSuite
+      - EXTRA_PARAMS: --run TestECSChecksSuite
+      - EXTRA_PARAMS: --run TestECSPlatformSuite


What’s the most expensive in e2e tests is the infra management.
The test execution is usually extremely fast.

If we run the tests sequentially, they would last:
20 minutes (spawn an ECS cluster, pull the datadog-agent images on it, deploy fake apps)
+ 10 seconds for TestECSAPMSuite
+ 10 seconds for TestECSManagedSuite
+ 10 seconds for TestECSChecksSuite
+ 10 seconds for TestECSPlatformSuite
= 20 minutes 40 seconds.

If we parallelize the tests, they would last:
max of (
20 minutes (spawn of infra) + 10 seconds for TestECSAPMSuite,
20 minutes (spawn of infra) + 10 seconds for TestECSManagedSuite,
20 minutes (spawn of infra) + 10 seconds for TestECSChecksSuite,
20 minutes (spawn of infra) + 10 seconds for TestECSPlatformSuite,
) = 20 minutes 10 seconds.

So, by parallelizing the tests like this, we will save 30 seconds of elapsed time over a total duration of 20 minutes at the cost of being 4 times more expensive as it will require to have 4 times more ECS clusters.

For the exact figures about how long a single test lasts compared to whole suite with the infra management, see: https://app.datadoghq.com/ci/test/runs?query=test_level%3Atest%20%40test.service%3Adatadog-agent%20%40git.branch%3Ajlineaweaver%2Fexp-133%20%40test.suite%3A%22github.com%2FDataDog%2Fdatadog-agent%2Ftest%2Fnew-e2e%2Ftests%2Fecs%22&agg_m=count&agg_m_source=base&agg_t=count&citest_explorer_sort=time%2Cdesc&currentTab=overview&eventStack=&fromUser=true&index=citest&mode=sliding&refresh_mode=sliding&saved-view-id=2236559&start=1773180070024&end=1775772070024&paused=false

- Use Pulumi context instead of context.Background() in wait.go - Remove DD_LOG_LEVEL=debug to prevent fakeintake OOM issues - Remove release note (not customer-facing) - Remove WithWindowsNodeGroup() to avoid CI regression (PR #46377) - Remove CI parallel matrix (4x infra cost for ~30s savings) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The aspnetsample Fargate app deployment is gated on WindowsNodeGroup being set. Since we removed WithWindowsNodeGroup() to avoid CI regression, this test has no workload to validate. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

L3n41c · 2026-04-13T13:30:59Z

+  needs:
+    - !reference [.needs_new_e2e_template]
+    - qa_agent
+    - qa_agent_jmx
+    - qa_dca
+    - qa_dogstatsd


These jobs build a multi-arch image, with Windows.
As Windows build take ages, it would be beneficial to depend on Linux-only images if Windows tests are disabled.
I think this done by using

extends: - .new_e2e_template_needs_container_deploy_linux

instead of

Suggested change

needs:

- !reference [.needs_new_e2e_template]

- qa_agent

- qa_agent_jmx

- qa_dca

- qa_dogstatsd

L3n41c · 2026-04-13T13:40:29Z

This new-e2e-containers-ecs job is most probably not needed anymore. TestECSSuite was defined in ecs_test.go which has been removed. This job will run nothing anymore.

Suggested change

L3n41c · 2026-04-13T14:04:34Z

+func NewManagedNodeGroup(e aws.Environment, clusterName pulumi.StringInput) (pulumi.StringOutput, error) {
+	// Use the same ECS-optimized AMI as regular node groups
+	ecsAmi, err := ssm.LookupParameter(e.Ctx(), &ssm.LookupParameterArgs{
+		Name: "/aws/service/ecs/optimized-ami/amazon-linux-2/recommended/image_id",
+	}, e.WithProvider(config.ProviderAWS))
+	if err != nil {
+		return pulumi.StringOutput{}, err
+	}
+
+	// Managed instances use similar configuration but with ECS-managed ASG
+	// For testing purposes, we create a standard node group that ECS will manage
+	return newNodeGroup(e, "managed-ng", pulumi.String(ecsAmi.Value), pulumi.String(e.DefaultInstanceType()), getUserData(linuxInitUserData, clusterName))


I didn’t manage to spot the differences between this function and NewECSOptimizedNodeGroup.
Would have it been possible to use NewECSOptimizedNodeGroup instead of creating this new function?

L3n41c · 2026-04-13T14:17:19Z

+
+		// Git metadata
+		regexp.MustCompile(`git\.commit\.sha:[[:xdigit:]]{40}`),
+		regexp.MustCompile(`git.repository_url:https://github.com/DataDog/test-infra-definitions`),


We miss the “begin of text”/“end of text” anchors to ensure there’s no garbage at the beginning or the end of tags.

For ex., having both:

regexp.MustCompile(`container_name:`), regexp.MustCompile(`ecs_container_name:tracegen`),

is useless because any string matching the second regex will necessarily also match the former.

If we expect to have both container_name:.* and ecs_container_name:.* tags, then the test is incorrect because absence of container_name:.* will not be caught as long as ecs_container_name:.* is present.

Suggested change

regexp.MustCompile(`git.repository_url:https://github.com/DataDog/test-infra-definitions`),

// Core ECS metadata

regexp.MustCompile(`^ecs_cluster_name:` + regexp.QuoteMeta(suite.ecsClusterName) + `$`),

regexp.MustCompile(`^task_arn:`),

regexp.MustCompile(`^container_name:`),

regexp.MustCompile(`^ecs_container_name:tracegen$`),

regexp.MustCompile(`^task_family:.*-` + regexp.QuoteMeta(taskName) + `-ec2$`),

regexp.MustCompile(`^task_name:.*-` + regexp.QuoteMeta(taskName) + `-ec2$`),

regexp.MustCompile(`^task_version:[[:digit:]]+$`),

// Image metadata

regexp.MustCompile(`^docker_image:ghcr\.io/datadog/apps-tracegen:` + regexp.QuoteMeta(apps.Version) + `$`),

regexp.MustCompile(`^image_name:ghcr\.io/datadog/apps-tracegen$`),

regexp.MustCompile(`^image_tag:` + regexp.QuoteMeta(apps.Version) + `$`),

regexp.MustCompile(`^short_image:apps-tracegen$`),

// Git metadata

regexp.MustCompile(`^git\.commit\.sha:[[:xdigit:]]{40}$`),

regexp.MustCompile(`^git.repository_url:https://github.com/DataDog/test-infra-definitions$`),

L3n41c · 2026-04-13T14:23:08Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2c861cfcb8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-13T14:28:49Z

+	if isEC2ProviderSet(clusterParams) {
+		ctx.Log.Info("Waiting for EC2 container instances to register with the cluster...", nil)
+		ec2ClusterArn = resourcesEcs.WaitForContainerInstances(awsEnv, cluster.ClusterArn, 1)
+	}
+
 	// Testing workload if at least one EC2 node group is present
 	if params.testingWorkload && isEC2ProviderSet(clusterParams) {


Include managed node groups in EC2 workload gating

WithManagedInstanceNodeGroup() never satisfies the EC2-provider checks here, so managed-instance runs skip both the container-instance readiness wait and the WithTestingWorkload() EC2 app deployment path. In practice this means managed ECS scenarios exercise less infrastructure than requested (and can become timing-flaky because workloads are scheduled before instances are known ready). Treat ManagedInstanceNodeGroup as an EC2 provider in the gating logic used by these branches.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-13T14:28:49Z

+			scenecs.WithECSOptions(
+				scenecs.WithFargateCapacityProvider(),
+				scenecs.WithLinuxNodeGroup(),
+				scenecs.WithLinuxBottleRocketNodeGroup(),
+			),


Re-enable Windows capacity in ECS platform suite

This suite provisions only Fargate + Linux + BottleRocket, so after removing the old containers/ecs_test.go suite there is no ECS test path that enables WithWindowsNodeGroup() and validates Windows workloads. That drops multi-platform ECS coverage and leaves the Windows Fargate path unexercised in CI despite this migration being the new ECS test home.

Useful? React with 👍 / 👎.

- Use Linux-only CI template to avoid waiting on Windows image builds - Remove stale new-e2e-containers-ecs job (TestECSSuite was deleted) - Replace NewManagedNodeGroup with NewECSOptimizedNodeGroup (identical) - Add ^/$ anchors to trace tag regex patterns with comma-split matching - Include ManagedInstanceNodeGroup in isEC2ProviderSet Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

WithTestingWorkload() already deploys tracegen as an EC2 workload now that ManagedInstanceNodeGroup is in isEC2ProviderSet. Remove the redundant WithWorkloadApp(tracegen...) that caused a duplicate URN. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

L3n41c · 2026-04-23T08:55:18Z

Code review

Found 4 issues:

checks_test.go is missing Test00UpAndRunning / AssertECSTasksReady, while the three other new suites all include it. Per the helper's own docstring, this is required to avoid flaky startup. (base_helpers.go says "This should be called as the first test (Test00UpAndRunning) in each suite to ensure infrastructure is ready before other tests run.")

datadog-agent/test/new-e2e/tests/ecs/checks_test.go

Lines 42 to 55 in 6cddb13

    
           func (suite *ecsChecksSuite) SetupSuite() { 
        
           	suite.BaseSuite.SetupSuite() 
        
           	suite.Fakeintake = suite.Env().FakeIntake.Client() 
        
           	suite.ecsClusterName = suite.Env().ECSCluster.ClusterName 
        
           	suite.ClusterName = suite.Env().ECSCluster.ClusterName 
        
           } 
        
           func (suite *ecsChecksSuite) TestNginxECS() { 
        
           	// `nginx` check is configured via docker labels 
        
           	// Test it is properly scheduled 
        
           	suite.AssertMetric(&TestMetricArgs{ 
        
           		Filter: TestMetricFilterArgs{ 
        
           			Name: "nginx.net.request_per_s",

README.md and the PR description advertise a TestWindowsFargate test in platform_test.go ("Windows container support on Fargate"), and the coverage matrix claims "Windows Support: Fargate Yes" — but platform_test.go defines no such test (only Test00UpAndRunning, TestCPU, TestContainerLifecycle), and no Windows node group is provisioned. Root AGENTS.md flags stale docs as a review concern: "If a PR changes behavior but doesn't update the corresponding docs, comments, or doc strings, flag it."

datadog-agent/test/new-e2e/tests/ecs/README.md

Lines 60 to 77 in 6cddb13

    
           **Tests**: 
        
           - `Test00UpAndRunning` - Infrastructure readiness check 
        
           - `TestWindowsFargate` - Windows container support on Fargate (check run + container metric tag validation) 
        
           - `TestCPU` - CPU metrics with value range validation (stress-ng workload) 
        
           - `TestContainerLifecycle` - Container lifecycle tracking (multi-container metric validation) 
        
           **Key Features Tested**: 
        
           - Windows container monitoring on Fargate 
        
           - BottleRocket node support 
        
           - CPU metric value range validation 
        
           - Multi-container lifecycle tracking 
        
           --- 
        
           ### 4. `managed_test.go` - Managed Instances (3 tests) 
        
           Tests managed instance-specific features.

datadog-agent/test/new-e2e/tests/ecs/platform_test.go

Lines 30 to 58 in 6cddb13

    
           func TestECSPlatformSuite(t *testing.T) { 
        
           	t.Parallel() 
        
           	e2e.Run(t, &ecsPlatformSuite{}, e2e.WithProvisioner(provecs.Provisioner( 
        
           		provecs.WithRunOptions( 
        
           			scenecs.WithECSOptions( 
        
           				scenecs.WithFargateCapacityProvider(), 
        
           				scenecs.WithLinuxNodeGroup(), 
        
           				scenecs.WithLinuxBottleRocketNodeGroup(), 
        
           			), 
        
           			scenecs.WithFakeIntakeOptions( 
        
           				scenfi.WithRetentionPeriod("31m"), 
        
           			), 
        
           			scenecs.WithTestingWorkload(), 
        
           		), 
        
           	))) 
        
           } 
        
           func (suite *ecsPlatformSuite) SetupSuite() { 
        
           	suite.BaseSuite.SetupSuite() 
        
           	suite.Fakeintake = suite.Env().FakeIntake.Client() 
        
           	suite.ecsClusterName = suite.Env().ECSCluster.ClusterName 
        
           	suite.ClusterName = suite.Env().ECSCluster.ClusterName 
        
           } 
        
           func (suite *ecsPlatformSuite) Test00UpAndRunning() { 
        
           	suite.AssertECSTasksReady(suite.ecsClusterName) 
        
           }

README.md instructs users to run the suites with raw go test -v -timeout 30m ./test/new-e2e/tests/ecs/ -run TestECS*Suite. Root AGENTS.md is explicit that E2E tests must be invoked via dda inv new-e2e-tests.run --targets=./tests/<area>/... — raw go test is listed under "Never run these commands directly" because the build tag wrapping is handled by the invoke task.

datadog-agent/test/new-e2e/tests/ecs/README.md

Lines 98 to 120 in 6cddb13

    
           ### Running Individual Suites 
        
           ```bash 
        
           # Run APM tests only 
        
           go test -v -timeout 30m ./test/new-e2e/tests/ecs/ -run TestECSAPMSuite 
        
           # Run checks tests only 
        
           go test -v -timeout 30m ./test/new-e2e/tests/ecs/ -run TestECSChecksSuite 
        
           # Run platform tests only 
        
           go test -v -timeout 30m ./test/new-e2e/tests/ecs/ -run TestECSPlatformSuite 
        
           # Run managed instance tests only 
        
           go test -v -timeout 30m ./test/new-e2e/tests/ecs/ -run TestECSManagedSuite 
        
           ``` 
        
           ### Running All ECS Tests 
        
           ```bash 
        
           go test -v -timeout 60m ./test/new-e2e/tests/ecs/... 
        
           ```

WithManagedInstanceNodeGroup() delegates to ecs.NewECSOptimizedNodeGroup(e, ecsCluster.Name, false) — the exact same call already used for the regular LinuxNodeGroup branch a few lines above. The managed path therefore provisions ordinary EC2 capacity rather than exercising ECS managed-instance lifecycle, so managed_test.go does not actually test managed instances. A prior review comment already asked about this and the thread does not appear resolved.

datadog-agent/test/e2e-framework/scenarios/aws/ecs/cluster.go

Lines 36 to 80 in 6cddb13

    
           } 
        
           if params.LinuxNodeGroup { 
        
           	cpName, err := ecs.NewECSOptimizedNodeGroup(e, ecsCluster.Name, false) 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           	capacityProviders = append(capacityProviders, cpName) 
        
           } 
        
           if params.LinuxARMNodeGroup { 
        
           	cpName, err := ecs.NewECSOptimizedNodeGroup(e, ecsCluster.Name, true) 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           	capacityProviders = append(capacityProviders, cpName) 
        
           } 
        
           if params.LinuxBottleRocketNodeGroup { 
        
           	cpName, err := ecs.NewBottlerocketNodeGroup(e, ecsCluster.Name) 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           	capacityProviders = append(capacityProviders, cpName) 
        
           } 
        
           if params.WindowsNodeGroup { 
        
           	cpName, err := ecs.NewWindowsNodeGroup(e, ecsCluster.Name) 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           	capacityProviders = append(capacityProviders, cpName) 
        
           } 
        
           if params.ManagedInstanceNodeGroup { 
        
           	cpName, err := ecs.NewECSOptimizedNodeGroup(e, ecsCluster.Name, false) 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           	capacityProviders = append(capacityProviders, cpName)

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

L3n41c

Currently, the ECS e2e tests log a link to a dashboard with all the needed information to debug failing tests.
Here is an example of such logs: https://gitlab.ddbuild.io/DataDog/datadog-agent/-/jobs/1622837027#L264
This dashboard is fed with the data dual-shipped by the agent.

On the GitLab job of this PR, I couldn’t find this link: https://gitlab.ddbuild.io/DataDog/datadog-agent/-/jobs/1616404316

Also, TestECSAPMSuite and TestECSChecksSuite are both spawning similar ECS clusters in parallel. This consumes twice the needed resources. Would it be possible to execute the tests on the same ECS clusters?
Given that the time to spawn the ECS clusters dominates the test execution time by several order of magnitudes, it won’t degrade the total tests elapsed time it a visible way.
TestECSManagedSuite and TestECSPlatformSuite are spawning ECS clusters with different settings. But if we could create a single ECS cluster with the settings that allow to execute all the tests on it, it would be better.

L3n41c · 2026-04-23T09:19:41Z

+		ecs.TaskDefinitionKeyValuePairArgs{
+			Name:  pulumi.StringPtr("DD_URL"),
+			Value: fakeintake.URL.ToStringOutput(),
+		},
+		ecs.TaskDefinitionKeyValuePairArgs{
+			Name:  pulumi.StringPtr("DD_APM_DD_URL"),
+			Value: fakeintake.URL.ToStringOutput(),
+		},


Isn’t the fake intake feed already setup through DD_ADDITIONAL_ENDPOINTS below ?

datadog-agent/test/e2e-framework/components/datadog/agent/ecs.go

Lines 264 to 271 in a3950e9

ecs.TaskDefinitionKeyValuePairArgs{

Name: pulumi.StringPtr("DD_ADDITIONAL_ENDPOINTS"),

Value: pulumi.Sprintf(`{"%s": ["FAKEAPIKEY"]}`, fakeintake.URL.ToStringOutput()),

},

ecs.TaskDefinitionKeyValuePairArgs{

Name: pulumi.String("DD_LOGS_CONFIG_ADDITIONAL_ENDPOINTS"),

Value: pulumi.Sprintf(`[{"host": "%s"}]`, fakeintake.Host),

},

If we set both DD_URL and DD_ADDITIONAL_ENDPOINTS to the fake intake, won’t we have the agent feed the fake intake twice instead of dual-shipping the data to both:

dddev for investigation purpose and

the fake intake for tests assertions
?

Could we validate that the ECS e2e tests dashboard is still properly fed?

On the main branch, the ECS e2e tests logs a link to this dashboard with the template parameters and the time frame already filled like: https://gitlab.ddbuild.io/DataDog/datadog-agent/-/jobs/1622837027#L264 that links to https://dddev.datadoghq.com/dashboard/mnw-tdr-jd8/e2e-tests-containers-ecs?refresh_mode=paused&tpl_var_ecs_cluster_name%5B0%5D=ci-1622837027-4670-e2e-ecssuite-2f0369522723a4c3-ecs&tpl_var_fake_intake_task_family%5B0%5D=ci-1622837027-4670-e2e-ecssuite-2f0369522723a4c3-fakeintake-ecs&from_ts=1776933985525&to_ts=1776934377141&live=false .

But in the GitLab job of this PR, I couldn’t find the link to dashboard: https://gitlab.ddbuild.io/DataDog/datadog-agent/-/jobs/1616404316

L3n41c · 2026-04-23T09:30:21Z

+	return pulumi.All(clusterArn).ApplyT(func(args []interface{}) (string, error) {
+		clusterArnStr := args[0].(string)


pulumi.All(…) is only needed when we want to wait for several resources to be created.
Here, we only wait for clusterArn.
Why can’t we just simply do:

Suggested change

return pulumi.All(clusterArn).ApplyT(func(args []interface{}) (string, error) {

clusterArnStr := args[0].(string)

return clusterArn.ApplyT(func(clusterArnStr string) (string, error) {

That’s simpler.

L3n41c · 2026-04-23T09:48:51Z

+		PortMappings: ecs.TaskDefinitionPortMappingArray{
+			ecs.TaskDefinitionPortMappingArgs{
+				ContainerPort: pulumi.IntPtr(8080),
+				HostPort:      pulumi.IntPtr(8080),
+				Protocol:      pulumi.StringPtr("tcp"),
+			},
+		},


Just out of curiosity, why is it needed to expose the port on the host?
Shouldn’t the agent be able to connect to the application with the Pod IP?
Especially on Fargate where the agent runs as a side-car container, so in the same network namespace as the application?

L3n41c · 2026-04-23T09:50:55Z

+		// Can be replaced by require.NoErrorf(…) once https://github.com/stretchr/testify/pull/1481 is merged
+		if !assert.NoErrorf(c, cerr, "Failed to query fake intake") {
+			return
+		}


Suggested change

// Can be replaced by require.NoErrorf(…) once https://github.com/stretchr/testify/pull/1481 is merged

if !assert.NoErrorf(c, cerr, "Failed to query fake intake") {

return

}

// Can be replaced by require.NoErrorf(…) once https://github.com/stretchr/testify/pull/1481 is merged

require.NoErrorf(c, cerr, "Failed to query fake intake")

L3n41c · 2026-04-23T10:00:00Z

+
+**Tests**:
+- `Test00UpAndRunning` - Infrastructure readiness check
+- `TestWindowsFargate` - Windows container support on Fargate (check run + container metric tag validation)


I think this has been removed, hasn’t it?
I couldn’t find TestWindowsFargate in platform_test.go anymore.

Suggested change

- `TestWindowsFargate` - Windows container support on Fargate (check run + container metric tag validation)

L3n41c · 2026-04-23T10:04:24Z

+```bash
+# Run APM tests only
+go test -v -timeout 30m ./test/new-e2e/tests/ecs/ -run TestECSAPMSuite
+
+# Run checks tests only
+go test -v -timeout 30m ./test/new-e2e/tests/ecs/ -run TestECSChecksSuite
+
+# Run platform tests only
+go test -v -timeout 30m ./test/new-e2e/tests/ecs/ -run TestECSPlatformSuite
+
+# Run managed instance tests only
+go test -v -timeout 30m ./test/new-e2e/tests/ecs/ -run TestECSManagedSuite
+```


We should favor the invoke task.

Suggested change

```bash

# Run APM tests only

go test -v -timeout 30m ./test/new-e2e/tests/ecs/ -run TestECSAPMSuite

# Run checks tests only

go test -v -timeout 30m ./test/new-e2e/tests/ecs/ -run TestECSChecksSuite

# Run platform tests only

go test -v -timeout 30m ./test/new-e2e/tests/ecs/ -run TestECSPlatformSuite

# Run managed instance tests only

go test -v -timeout 30m ./test/new-e2e/tests/ecs/ -run TestECSManagedSuite

```

```bash

# Run APM tests only

dda inv new-e2e-tests.run --targets=./test/new-e2e/tests/ecs/ --run TestECSAPMSuite

# Run checks tests only

dda inv new-e2e-tests.run --targets=./test/new-e2e/tests/ecs/ --run TestECSChecksSuite

# Run platform tests only

dda inv new-e2e-tests.run --targets=./test/new-e2e/tests/ecs/ --run TestECSPlatformSuite

# Run managed instance tests only

dda inv new-e2e-tests.run --targets=./test/new-e2e/tests/ecs/ --run TestECSManagedSuite

```

L3n41c · 2026-04-23T10:07:30Z

+```bash
+go test -v -timeout 60m ./test/new-e2e/tests/ecs/...
+```


Use the invoke task instead.

Suggested change

```bash

go test -v -timeout 60m ./test/new-e2e/tests/ecs/...

```

```bash

dda inv new-e2e-tests.run --targets=./test/new-e2e/tests/ecs/...

```

github-actions Bot added long review PR is complex, plan time to review it team/agent-devx labels Dec 10, 2025

JLineaweaver added the qa/skip-qa label Dec 10, 2025

JLineaweaver added this to the 7.75.0 milestone Dec 10, 2025

JLineaweaver added 4 commits December 10, 2025 13:47

fix(e2e): Fix syntax errors in ECS test provisioner setup

05e3f4c

Fixed incorrect closing brace (}) to closing parenthesis ()) in WithECSOptions() calls in three test files: - apm_test.go - config_test.go - logs_test.go This resolves the linter failures that prevented compilation.

docs(releasenotes): Add release note for comprehensive ECS E2E testin…

e006957

…g framework Added release note documenting the new ECS E2E testing infrastructure including 7 test suites, 3 test applications, and support for all ECS deployment types (Fargate, EC2, Managed Instances).

JLineaweaver added qa/no-code-change No code change in Agent code requiring validation and removed qa/skip-qa labels Dec 10, 2025

JLineaweaver added 11 commits December 10, 2025 14:20

fix(e2e): Remove unused imports from apm_test.go

b3ddf58

JLineaweaver added 4 commits December 12, 2025 12:08

Merge branch 'main' into jlineaweaver/exp-133

ba4d557

Resolve merge conflicts: - CODEOWNERS: keep new /tests/ecs/ ownership + new cluster_agent_sgc entry from main - e2e.yml: keep new-e2e-ecs job + openshift comment from main - go.mod: drop removed gopkg.in/ini.v1 dependency (removed in main)

hmmferreira approved these changes Apr 8, 2026

View reviewed changes

dd-octo-sts Bot removed the stale label Apr 8, 2026

JLineaweaver modified the milestones: 7.78.0, 7.79.0 Apr 9, 2026

JLineaweaver marked this pull request as ready for review April 9, 2026 17:03

JLineaweaver requested review from a team as code owners April 9, 2026 17:03

chatgpt-codex-connector Bot reviewed Apr 9, 2026

View reviewed changes

justin-lesko approved these changes Apr 9, 2026

View reviewed changes

Mathew-Estafanous approved these changes Apr 9, 2026

View reviewed changes

L3n41c requested changes Apr 9, 2026

View reviewed changes

JLineaweaver added changelog/no-changelog No changelog entry needed and removed qa/no-code-change No code change in Agent code requiring validation labels Apr 10, 2026

JLineaweaver added qa/skip-qa qa/no-code-change No code change in Agent code requiring validation labels Apr 10, 2026

L3n41c reviewed Apr 13, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 13, 2026

View reviewed changes

JLineaweaver requested a review from L3n41c April 13, 2026 17:58

JLineaweaver and others added 2 commits April 15, 2026 16:23

Merge branch 'main' into jlineaweaver/exp-133

6cddb13

L3n41c reviewed Apr 23, 2026

View reviewed changes

		ctx := context.Background()
		cfg, err := awsconfig.LoadDefaultConfig(ctx,

	listOutput, err := ecsClient.ListContainerInstances(ctx, &ecs.ListContainerInstancesInput{
	listOutput, err := ecsClient.ListContainerInstances(e.Ctx().Context(), &ecs.ListContainerInstancesInput{

-		regexp.MustCompile(`git.repository_url:https://github.com/DataDog/test-infra-definitions`),
+		// Core ECS metadata
+		regexp.MustCompile(`^ecs_cluster_name:` + regexp.QuoteMeta(suite.ecsClusterName) + `$`),
+		regexp.MustCompile(`^task_arn:`),
+		regexp.MustCompile(`^container_name:`),
+		regexp.MustCompile(`^ecs_container_name:tracegen$`),
+		regexp.MustCompile(`^task_family:.*-` + regexp.QuoteMeta(taskName) + `-ec2$`),
+		regexp.MustCompile(`^task_name:.*-` + regexp.QuoteMeta(taskName) + `-ec2$`),
+		regexp.MustCompile(`^task_version:[[:digit:]]+$`),
+		// Image metadata
+		regexp.MustCompile(`^docker_image:ghcr\.io/datadog/apps-tracegen:` + regexp.QuoteMeta(apps.Version) + `$`),
+		regexp.MustCompile(`^image_name:ghcr\.io/datadog/apps-tracegen$`),
+		regexp.MustCompile(`^image_tag:` + regexp.QuoteMeta(apps.Version) + `$`),
+		regexp.MustCompile(`^short_image:apps-tracegen$`),
+		// Git metadata
+		regexp.MustCompile(`^git\.commit\.sha:[[:xdigit:]]{40}$`),
+		regexp.MustCompile(`^git.repository_url:https://github.com/DataDog/test-infra-definitions$`),

	ecs.TaskDefinitionKeyValuePairArgs{
	Name: pulumi.StringPtr("DD_ADDITIONAL_ENDPOINTS"),
	Value: pulumi.Sprintf(`{"%s": ["FAKEAPIKEY"]}`, fakeintake.URL.ToStringOutput()),
	},
	ecs.TaskDefinitionKeyValuePairArgs{
	Name: pulumi.String("DD_LOGS_CONFIG_ADDITIONAL_ENDPOINTS"),
	Value: pulumi.Sprintf(`[{"host": "%s"}]`, fakeintake.Host),
	},

		return pulumi.All(clusterArn).ApplyT(func(args []interface{}) (string, error) {
		clusterArnStr := args[0].(string)

	return pulumi.All(clusterArn).ApplyT(func(args []interface{}) (string, error) {
	clusterArnStr := args[0].(string)
	return clusterArn.ApplyT(func(clusterArnStr string) (string, error) {

Conversation

JLineaweaver commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What Does This PR Do?

What changed

Infrastructure config

Key design decisions

Motivation

Possible Drawbacks / Trade-offs

Additional Notes

Uh oh!

agent-platform-auto-pr Bot commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Static quality checks

Info

Uh oh!

cit-pr-commenter Bot commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Regression Detector

Regression Detector Results

Optimization Goals: ✅ No significant changes detected

Experiments ignored for regressions

Fine details of change detection per experiment

Bounds Checks: ✅ Passed

Explanation

CI Pass/Fail Decision

Uh oh!

agent-platform-auto-pr Bot commented Dec 12, 2025 • edited by dd-octo-sts Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Gitlab CI Configuration Changes

Added Jobs

Removed Jobs

Changes Summary

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

L3n41c commented Apr 13, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

L3n41c commented Apr 23, 2026

Code review

Uh oh!

L3n41c left a comment

JLineaweaver commented Dec 10, 2025 •

edited

Loading

agent-platform-auto-pr Bot commented Dec 10, 2025 •

edited

Loading

cit-pr-commenter Bot commented Dec 10, 2025 •

edited

Loading

agent-platform-auto-pr Bot commented Dec 12, 2025 •

edited by dd-octo-sts Bot

Loading